CUDA 프로그래밍 가이드: 하드웨어를 소프트웨어로 매핑하기: 컴퓨팅 능력 버전

컴퓨팅 능력(Compute Capability, CC)은 다음 두 요소 사이의 버전 관리 다리를 담당합니다: 가상 아키텍처 (PTX)와 실제 아키텍처 (SASS/바이너리). 개발자는 nvcc 을 사용하여 데스크톱/서버 플랫폼에서 임베디드 플랫폼까지 다양한 플랫폼을 타겟으로 설정하며, 운영체제 모델인 Linux 64비트 (LP64) 또는 Windows 64비트 (LLP64)에 따라 합니다.

1. 가상 아키텍처 대 실제 아키텍처

CUDA 툴킷은 최근 두 개의 주요 버전의 GPU 아키텍처를 지원하며, 이는 표 29: 컴퓨팅 능력(7.5~12.x)에 따른 기능 지원 현황에 언급되어 있습니다. 우리는 다음과 같은 플래그를 사용하여 매핑을 정의합니다: nvcc --generate-code arch=compute_80,code=sm_90 prog.cu. 미래 지향적인 타겟을 위해, 다음과 같은 플래그를 사용합니다: nvcc -arch=sm_100 또는 특수한 변형인 nvcc -arch=sm_100a 을 사용합니다.

2. 매크로 계층 구조

컴파일러는 __CUDA_ARCH__ 을 사용하여 코드 분기 처리를 합니다. 매크로 __CUDA_ARCH__는 디바이스 코드 내에서만 정의됩니다 (예: __device__, __global__)에서만 가능합니다. 더 세밀한 제어는 다음 두 매크로를 통해 제공됩니다: __CUDA_ARCH_SPECIFIC__ 및 __CUDA_ARCH_FAMILY_SPECIFIC__. 특정 기능은 예를 들어 분산 공유 메모리 또는 특정 NaN 페이로드는 컴퓨팅 능력 9.0 이상 또는 컴퓨팅 능력 10.0 이상에 따라 합니다.

3. 수치적 제한 및 제약 조건

정밀도는 컴퓨팅 능력(CC)에 따라 달라집니다. 예를 들어, 서브노멀 처리는 $2^{-16382} \approx 3.36 \cdot 10^{-4932}$를 보장합니다. 하드웨어 제한 사항 중 일부는 CUDA_DEVICE_MAX_COPY_CONNECTIONS=16 또는 .maxnreg PTX 지시어 을 타겟 컴퓨팅 능력 버전에 따라 엄격히 적용합니다.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

Where is the __CUDA_ARCH__ macro strictly defined?

In both host and device code to identify the GPU hardware.

Only in device code (__device__, __host__ __device__, or __global__).

Only in the host-side main() function.

In the NVML API headers only.

QUESTION 2

Which command correctly targets a virtual arch of 8.0 and a real arch of 9.0?

nvcc -arch=sm_80,code=compute_90

nvcc --generate-code arch=compute_80,code=sm_90

nvcc --target=cc_8.0_9.0

nvcc -arch=sm_90

QUESTION 3

What is the consequence of declaring 'namespace cuda { struct foo; }' in CUDA code?

It enables high-speed memory access.

It is an error; the 'cuda' namespace is reserved.

It is required for using cuda::std::result_of.

It allows the use of __nv_atomic_load.

QUESTION 4

Which mathematical property is associated with CC 9.x and higher?

Support for 16-byte data types.

Basic support for fabs(x) and sin(x).

Introduction of the warpSize macro.

Ability to use x + y in kernels.

QUESTION 5

What happens when an extended lambda is defined inside a generic lambda using Microsoft Visual Studio host compilers?

It compiles successfully and inlines perfectly.

The host compiler may fail to inline or throw an error.

It enables __managed__ memory by default.

It triggers the on-disk JIT compilation cache.